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We investigate correlations in information carriers, e.g. texts and pieces of music, which are 
represented by strings of letters. For information carrying strings generated by one source (i.e. a 
novel or a piece of music) we find correlations on many length scales. The word distribution, the 
higher order entropies and the transinformation are calculated. The analogy to strings generated 
through symbolic dynamics by nonlinear systems in critical states is discussed. 



I. INTRODUCTION 

In the manifold of structures which are used as infor- 
mation carriers in nature, culture and engineering, linear 
strings consisting of sequences of letters play a central 
role. This may be demonstrated by the following exam- 
ples: The proteins and polynucleotids, the main infor- 
mation carriers in living systems are sequences of amino 
acids and/or nucleotides |Tl|J^ Jl0| , p^| . Further most of 
the messages transporting information between human 
informational systems have the form of strings of letters. 
Examples are books or letters, music, computer programs 
etc. By using the methods of symbolic dynamics any tra- 
jectory of a dynamic system may be mapped to a string 
of letters on certain alphabet |24| ||. Hence each se- 
quence can be interpreted as the trajectory of a discrete 
dynamic system. 

This work is devoted to the investigation of strings of 
the type characterized above, i.e. to objects (documents, 
programs etc.) which may be mapped to strings of let- 
ters. The main aim of the investigation is the analysis of 
long range correlations in information carrying strings. 
For several reasons we expect the existence of long range 
structures in these sequences p7j . Especially we expect 
correlations which range from the beginning of a string 
up to its end. Let us discuss some of the reasons for this 
behavior: 

1. Predictability: We know from our every-day expe- 
rience and from scientific research that the identi- 
fication of the first hundred or thousand letters of 
the string tells us already a lot about the continua- 
tion. Often we make a decision in a book shop after 
reading just one page. For example, if we find there 
several times the word love and tennis we expect to 



find them on the other pages again and again, but 
if we find first words like file and program we expect 
to remain in quite another field. Listening to the 
beginning of a Bach praeludium where the general 
theme is worked out we expect to hear it again in 
many variations up to the very end. Such expec- 
tations are only justified if there exist indeed long 
range correlations. This is the scientific expression 
of our intuitive expectations which are based on 
the experience that texts and music have certain 
inherent predictability. 

2. Syntactical limitations: Another heuristic reason 
to expect long range correlations is the exponential 
explosion of the variety of possible subwords with 
increasing length for uncorrelated strings. Uncor- 
related sequences generated on an alphabet of A 
letters have a variety of 



N{n) = A™ = exp (In A ■ n) 



(1) 



different subwords of length n. A subword is here 
any combination of letters including the space, 
punctuation marks etc. Strings of this type may 
be generated by stochastic processes of Bernoulli- 
type or by fully developed chaotic dynamics using 
alphabets of A letters. For n > 100 the number 
N(X) is extremely large. In other words we need 
very sharp restrictions to select a meaningful sub- 
set out of it. Long range correlations provide such 
a selection criterion. Denoting the selected subset 
by N* (n) we expect 



N*{n) 
N(n) 



if 



(2) 
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Bounds of this kind are given by syntax and se- 
mantics. The syntactical rules do not allow for an 
arbitrary concatenation of words to sentences, most 
of them are forbidden. Furthermore we know that 
texts (and pieces of music) are formed by keywords 
(motifs) which are the raw material for the gener- 
ation of a text (a piece of music). In fact all these 
rules lead to slower growth with n. 

A rather sharp restriction on the growth with n 
corresponds to the power law 



N*(n) 



(3) 



Symbolic strings generated by the logistic map 
in the Feigenbaum-point have this scaling prop- 
erty § . The conjecture that several natural objects 
have this type of scaling has been made earlier || . 
We will show here that pieces of music possibly be- 
long to this class. Another growth law which is 
much faster is the stretched exponential law 

N*{n) ~ exp(Cn Q ) with a < 1 . (4) 

This scaling which is typically for intermittent pro- 
cesses was observed for several texts B. We men- 
tion that in the limit a — > this law corresponds 
to the scaling in eq. (||). The reduction due to the 
scaling rule ([!]) is not as strong as in the scaling 
rule (|3|), however, the reduction factor correspond- 
ing to rule (||) is still enormous for large n. 

Evolution: The third reason to expect such behav- 
ior is the rather general idea that evolution oper- 
ates in regions where long relaxation times, l/f- 
noise and other long range correlations are essen- 

tial m@§. 



II. ENTROPY-LIKE MEASURES OF SEQUENCE 
STRUCTURE 

In the following section we will study entropy-like 
quantities as a measure of the long range correlations 
in sequences. In order to describe the structure of a 
given string of length L using an alphabet of A letters 
{A1A2 . . . A\} we introduce the following notations |Q: 
Let A1A2 ■ ■ ■ A n be the letters of a given substring of 
length n < L. Furthermore let 



p^{A x ...A n ) 



(5) 



be the probability to find in a string a block of length 
n (subword of length n) with the letters A\ . . . A n . The 
probability of having a pair with {11 — 2) arbitrary letters 
in between we denote by 



p W(A 1 ,A n )=pW(A 1 t???A n ) 
We introduce the following quantities: 



(6) 



1. the mutual information, also called transinforma- 
tion |2H 



I(n)= £p(«)(^,^.)log(_ 



P (n) (^.Af) 



)(A l )- P W(A J )J ' 



(7) 



2. the entropy per subword of length n 

H n = -Y J P {n \A 1 -..A n )\ogp^\A l ...A n ) 



(8) 



3. the uncertainty of the letter following a block of 
length n 



h n — H n+ \ — H n , 



(9) 



4. the entropy of the source (related to the 
Kolmogorov-Sinai entropy) 



h = lim h n . 



(10) 



In an earlier paper we assumed the scaling behavior 

(11) 



H n = n- h + g ■ ■ (log n) w + e 
< Mo < 1 or //Q = 1 , Hi < 



Related assumptions have been made independently by 
several authors |l2|,^2],[l5 27 1. From that point of view 
the long range order of strings may be well characterized 
by the asymptotic for large n. Chaotic and stochastic 
strings of the standard type have the property h > 0. 
Special cases are Bernoulli processes or fully developed 
chaos with 



pv>(A 1 ...A n )=a/\ n 



,(»)( 
H n = n- log A 
h n = log A . 



(12) 



In the following we write all entropy measures in units 
of log A. For Markov processes with memory m the un- 
certainty decreases during the first m steps and remains 
then constant 



=h , k > 



(13) 



On the other hand any string with a finite period p is 
characterized by 



Hp+k = const. 



hp+k = if k > 



(14) 



This means that any new letter added to a string does not 
increase the complexity of the sequence. Consequently 



2 



we find for periodic strings h = and g = 0. Of special 
interest to our further considerations are systems which 
are neither periodic nor chaotic and which are character- 
ized by 



g > , h < 1 



(15) 



Presently there are not enough data available to estimate 
the limit entropy h for homogeneous texts (written by 
one author). We follow in this respect the seminal work 
by Claude E. Shannon who concluded in his pioneering 
paper: "From this analysis it appears that, in ordinary 
literary English, the long range statistical effects (up to 
100 letters) reduce the entropy. . . " . Shannon gave an es- 
timate of 40 bits for the entropy of n = 100 letters, i.e. 
about 0.4 bit per letter. Transforming the bits to A-units 
(which we use throughout this paper) we get ifioo = 8 
and we find for the entropy per letter the value 0.08. 
According to several general investigations [^l) it is not 
likely that the uncertainty (entropy per letter) decreases 
still further for larger rai. Based on these considerations 
we assume in the following that the limit entropy (in A- 
units) is in the region 



0.01 < h < 0.1 



(16) 



Since a reliable estimate seems to be impossible at 
present we simply neglect the contribution nh to the 
block entropy in eq. (|TT|). In this way we obtain for the 
intermediate region 1 <C n <C 30 the formula which will 
be the basis for our further investigations 



H n = g • n"° • (lnn) Ml + e 
Special cases are: 

• logarithmic tails 



(17) 



g ■ (lnn) Ml + e , m < 



(18) 



• power law tails 

H n =g-n t "" +e , < Mo < 1 (19) 

The working hypothesis developed earlier JtJ is that 
strings characterized by eqs. ([l8]) or ( ]l^ ) being on the 
borderline between order and chaos might be prototypes 
of information carrying sequences. 

Following a relation derived by McMillan and Khinchin 
the n-letter entropy and the mean number of subwords 
of length n are related M 



N* = A 



(20) 



In this way a logarithmic entropy scaling corresponds to 
a power law of the numbers of subwords and a power law 
scaling of the entropy corresponds to a stretched expo- 
nential growth of the number of subwords. 



The mutual information (transinformation) defined by 
eq. (^) is not monotonically decreasing with increasing n. 
We define here long range tails as any non exponential 
decay or increase of the averaged transinformation I(n). 
Periodic strings show correlations of infinite long range. 
Periodicity with the period p implies that all conditional 
probabilities {A, A,-)/p (1) {A,) for n > are also peri- 
odic. This leads to a periodic behavior of the transinfor- 
mation. Therefore sequences with long range correlations 
show a fluctuating behavior at large scales |22,0]. 



III. MUTUAL INFORMATION AND WORD 
DISTRIBUTIONS FOR FINITE 
INFORMATION-CARRYING STRINGS 



Printed texts in natural languages and music written 
in the language of notes are examples of information- 
carrying strings. Other examples, which we do 
not consider here, are biosequences, where some evi- 
dence for the existence of long range correlations was 
found jl7 29 2^^]. Originally texts or pieces of music 
were generated by the writer or composer as a dynami- 
cal process in real time. Today we find in books or al- 
bums the frozen in results of this process in the form of 
a symbolic sequence. Certainly texts or pieces of music 
are symbolic sequences of high complexity. In contrast to 
other dynamical processes, writing and composing have 
developed during a long way of evolution being intended 
for communication between human beings. In spite of all 
these difficulties let us now follow the largely simplifying 
assumption due to Shannon, Mandelbrot and others that 
texts and pieces of music may be considered as the results 
of a stationary random process. Although this assump- 
tion is still controversial we will take it here as a basis for 
the further analysis. In our analysis we considered the 
following sequences: 



1. Sonata for piano forte op. 
L. v. Beethoven (L w 4, 920) 



31 No 2 by 



2. Moby Dick by H. Melville (L rs 1, 170, 200) 

3. Grimm's Fairy Tales by the Brothers Grimm (L sw 
1,435,800) 

Furthermore a few comparisons were made with the Prae- 
ludium in F-Major by J. S. Bach and with the sonata 
KV 311 by W. A. Mozart. In the case of texts we used 
an alphabet consisting of the 32 symbols 



a b c d e f 



x y 



# 



The last symbol ~ stands for the empty space and # for 
any number. In the case of music we encoded the notes 
for 2^ octaves beginning with the low A and ending with 
the high D on an alphabet with 32 symbols. The white 



3 



keys on the piano forte where encoded by the 18 symbols 



AHCDEFGahcdefgmopr 



and the black keys beginning with the lower Be and end- 
ing with the high Cis were encoded by the 12 symbols 



BIJKLbijklnq 



The pause was encoded by the score "_" and holding a 
tone by the In order to get a better statistics we 

also used compressed alphabets consisting of 3 letters O, 
M, L only. The letter O codes for vowels in the case of 
texts or for a move down in the case of music, the letter 
M stands for consonants or move ups, the letter L codes 
for all other symbols, e.g. pauses (spaces), holding the 
tone and punctuation marks. For the analysis of the mu- 
tual information we have to count here the frequencies of 
pairs. Since the number of different pairs is 32 2 = 1024 
we have a rather good statistics if the length L of the 
string satisfies the inequality L ^> 1024. As shown by 
several authors 22 2^JT^] the transformation is a reli- 
able measure of the correlations of letters at the distance 
n. Every peak at n in the transformation corresponds 
to a strong positive correlation. In fig. [l] we show the 
pair correlations of the Beethoven sonata and of music 
by Bach and Mozart p| ffl. The peaks show that there 
exist strong correlations between two notes at character- 
istic distances. The interpretation of these peaks must be 
left to specialists. We further notice some similarity in 
the correlation structure of Bach's and Beethoven's mu- 
sic and a distinct different structure of Mozart's music. A 
more detailed study of the differences between composers 
will be given in a separate paper Q|. 
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FIG. 1. The mutual information calculated for 
Beethoven's Sonata 32 No. 2. For comparison the re- 
sults for pieces by Bach and Mozart from other sources 
[Ebeling,1993] [Albrecht et. al.,1994] are given. 
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FIG. 2. The mutual information calculated for Moby 
Dick and for Grimm's Tales (A = 32). 
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FIG. 3. The mutual information (I(n), X = 32) cal- 
culated for the range n € [20; 1000] for Moby Dick. 
The fluctuation level due to the finite length of the text 
(L = 1, 170,200) is 0.00012 . 

In Figs. |^ and || the mutual information calculated for 
Moby Dick and for Grimm's Tales (A = 32) are drawn. 
The results show that there are well expressed correla- 
tions in the range n = 1 ... 25 which arc followed by a 
long slowly decaying tail. The results for the transfor- 
mation I(k) become meaningless if the values are smaller 
than the level of natural fluctuations due to the finite 
length L of the text which is H] 



51 



A 2 - 2 ■ A 
2 • L ■ In A 



(21) 



Although the fluctuation level decays with 1/L it has 
even for the rather long text Moby Dick with L = 
1,170,200 a value of about 0.00012. This means, as 
seen from Fig. || that any conclusions suggesting long 
range correlations beyond n — 300 are rather problem- 
atic. However, the range of studies of I(k) may still be 
extended by using length corrections ]l4| . An alternative 
method is based on studies of the dependence of I{k) on 
1/L (which presumably is linear) and by extrapolations 

(i/L) - co §g. 



4 



As we see from Figs. || and || the mutual information 
decreases monotonously up to n ~ 300 and converges 
into the fluctuation level. There are no well expressed 
correlation peaks. Evidently long texts are in this respect 
less correlated than DNA sequences where well expressed 
long range pair correlations have been found Q . 

The results obtained so far base only on the statistical 
distributions of pairs. In this case one can reach a rather 
good statistics by counting the probability of pairs. Let 
us study now the distribution of words of length n > 2. 
Due to the fact that the number of different words of 
length n using an alphabet consisting of 32 letters is 



N* ~ 32 r ' 



->5-n 



(22) 



there are for n = 9 already 2 45 words with approximately 
2 50 letters. This is much more than all texts stored in 
libraries which have been estimated to consist of about 
10 15 w 2 18 letters Therefore we could not do a 

statistic analysis if there were no additional constraints 
due to grammar and semantics. According to an earlier 
estimate M we expect that the growth law scales as 



n* 



exp(A • n a ) 



(23) 



i for texts. 



with a 

We have done the analysis with two long texts, namely 
Moby Dick and Grimm's Tales. Figs. [| and || show 
the rank ordered distribution of subwords of length n = 
4, 9, 16, 25. The structures of both distributions are 
similar but the lists of words are quite different. For ex- 
ample among the most frequent subwords of length n = 
16 in the case of Moby Dick are ll thesperm_whaleJ\ and 
ll the_quarter_deck v . For Grimm's Tales rather frequent 
subwords of length n — 25 are e.g. ll -if-i-Could_but_- 
shudder.J' and " jprincess,-open_the_door^ . 
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FIG. 4. The observed rank-ordered distribution of 
words of length n = 4, 9, 16 for Moby Dick. 
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FIG. 5. The observed rank-ordered distribution of 
words of length n — 4,9, 16, 25 for Grimm's Tales. 

The shapes of the subword distributions are distinctly 
not Zipf-like, they do not follow a power law. The distri- 
bution tends to form a Fermi-like plateau. This tendency 
is based on the theorem of asymptotic equipartition de- 
rived by McMillan fH and Khinchin @. This theorem 
proves that for n — » oo the asymptotic form of the dis- 
tribution is rectangular, i.e. 



Zij) 



1/N*(n) if i < N*(n) 
else . 



(24) 



The effects due to finite n and to the finite text length 
L tend to smooth the edges of its distribution. Since our 
texts are rather long the finite size effects do not have too 
much influence to the distributions. Much more difficult 
is the analysis of relative short pieces of music. Here 
additional problems arise due to the short sample. For 
example Beethoven's sonata consists of only 4,920 letters 
(notes). The importance of length corrections for esti- 
mating the frequencies of words was considered by sev- 
eral author s |25f| . For a deeper analysis of this problem 
we refer to |14(1 • The method we used here was found by 
generalizing a method proposed in [S . According to this 
method the unknown distribution function of the words 
is guessed by a comparison of expected (generated) and 
observed distributions. Instead of the simple rectangu- 
lar distributions in eq. (^) we applied a more realistic 
expression. For given word length n we guess the true 
(not normalized) distribution, i.e. the distribution for 
L — > oo, in the form 



Z(j, n) = Z (j, n) + Zx{j, n) + Z 2 (j, n) (25) 



with 



z {n) 



Zi(j,n) = zi(n) ■ exp - 



1 + cxp (b ■ (j - jo{n))) 



ji{n) 



(26) 
(27) 
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Zi{3,n) = (z 2 (n) - z 1 (n) - z (n)) ■ 



This distribution has 7 free parameters, one of them is 

fixed by the condition that the total number of words is 

given by the size of the text. For a string of length L the 

total number of n-words is N = L — n + 1. Each of the 

parameters has a simple meaning as: 

Z2 (n) - frequency of the most frequent word of length n, 

Z\ (n) - height of the exponential contribution at j = I, 

zo(n) - height of the asymptotic "Fermi" -plateau, 

]2{n) - number of frequent words, 

ji(n) - number of relatively frequent words, 

jo(n) - number of different words, 

b(n) - reciprocal "Fermi" -temperature of the plateau. 
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FIG. 6. Rank ordered word distribu- 
tion for Beethoven's sonata (n = 8, A = 3): Observed 
frequency (full line), generated frequency (dashed line) 
and theoretical curve (dotted line). 



All these parameters characterize certain properties of 
the assumed generating process. For complex processes 
such as writing of novels or pieces of music the parame- 
ters are of course unknown. Due to their simple meaning, 
however, it is easy to guess the initial values of the pa- 
rameters. The next step in an iterative process is then 
to generate a sample of n-words (with the help of the 
guessed probability distribution Z(j,n)) which has the 
same number of n-words as the text and transform it 
to a rank-ordered distribution. To overcome problems 
of low statistics we average over several generated dis- 
tributions. The result of this procedure is called the ex- 
pected distribution Z exp (j, n, L) according to the chosen 
set of parameters. The parameters are fitted by adap- 
tation of the expected to the empirical (observed) dis- 
tribution Z obs (j, n, L). The procedure is iterated up to 
a sufficient agreement between the observed and the ex- 
pected distributions 
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FIG. 7. Rank ordered distribution for Beethoven's 
sonata (n = 8, A = 32): Observed frequency (full line), 
generated frequency (dashed line) and theoretical curve 
(dotted line). 



[Z° bs (j, n, L) - Z ex *{j, n, L)] 2 ^ mm . (29) 



alphabet Z 2 (8) 


Zi(8) 


Zo(8) j 2 (8) 


Ji (8) jo (8) 


6(8) 


F(8) 


3 241 


37.5 


2.54 1.44 


41.4 908 


0.28 


2386 


32 19.4 


9.82 


1.25 2.8 


10.8 3733 


0.317 


1381 



TABLE I. Parameters of the distribution of words (se- 
quences of notes) of length n — 8 for Beethoven's Sonata. 



In the minimization procedure we used the simplex- 
method applying the program package MINUIT |T(| 
developed for non-linear parameter optimization. For 
word length n = 8 the optimization parameters are 
given in Tab. |. The corresponding distributions Z(j,n), 
Z exp (j,n, L) and Z obs {j,n,L) are presented in figs. (|^) 
and M\. 
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IV. ANALYSIS OF ENTROPY SCALING 

As shown above the set of entropies of subwords is 
the basic measure for the investigation of the order of 
a string. The probabilities one need for the calculation 
of the entropies are in general unknown and have to be 
estimated from the frequencies of words. The theoretical 
value of the block entropy follows from the distribution 
discussed in the last section. For simplicity we will omit 
in the following the block length n. The theoretical en- 
tropy is then 



H 



E 



m 

N 



log 



m 

N 



(30) 



The calculation of the block entropy from the distribu- 
tion function Z(i) is rather difficult and time consuming. 
The direct observation of the entropy is based on the 
frequencies of the different words. Replacing the proba- 
bilities by the observed frequencies Ni/N in the entropy 
definition we obtain the observed entropy 



2F^=]og(A0-iX>i-lQg(W) 



(31) 



This is a random variable with the expectation value 
H e Xp = {H o bs) = lQg{N) _ 1 J2( Nl . log (N t )) . (32) 



Let us assume now that the subwords are Bernoulli dis- 
tributed. Then we find (omitting further the index n 
which characterizes the length of the subwords) for the 
mean value 



(H ob °)= log N-^2Y1 



(N-iy. 



■Pi 



A', 



i N, 



(Ni-1)\ (N-Ni)\ ri 



(33) 



We will show now that the coefficients in this expansion 
are closely related to the Rcnyi entropies of order q which 
are defined by [O] 



(34) 



The Shannon entropies correspond to the limit q — ► 1 and 
the case q = is related to the total number of different 
subwords s: 



H = H& , S = \ H( 



(35) 



In the limit of very long strings, i.e. N 3> s, the expected 
entropies are given by the approximation |14|.|2l| 



(H° 



H 



2TVlog A 



(36) 



In the opposite case, where N <C s, the subwords may 
appear only very few times. []l4| Therefore we obtain 
the series 



(H obs ) = log TV - N • log 2 • 



N 



— (Iog3-2.1og2).^ i 



Using the definition of the Renyi entropies follows 



(37) 



(H obs ) = log TV - TV ■ log 2 • A 



-^.(log2-2.1og2).A-^ (3) 



(38) 



The higher order Renyi entropies can be found by fitting 
the coefficients of N k in this series to the observed en- 
tropies. Since the entropies decrease monotonously with 
their order follows that the second order entropies (q = 2) 
which can be obtained from eq. (|38|) represent a lower 
boundary of the first order Shannon entropies (q = 1). 

In this way we have obtained now three procedures 
to derive the entropies of human writings with a given 
length L: 

1. Calculation of the usual entropy (q = 1) from the 
observed distribution. Estimation of some finite 
length corrections using eq. (p6|). This is the stan- 



dard procedure applied e.g. in |l 

2. Guessing first the "true" distribution Z(i, n) on the 
basis of the observed distribution Z obs (i,n, L) and 
calculating then the entropies (for any q) from it. 
This method is essentially new. It was tested here 
but it still needs further elaboration. 

3. Calculation of the higher order entropies (especially 
q = 2) from the deviations between logiV and the 
observed entropies. This method which seems to 
be new too is of limited accuracy and yields only 
rough estimates for the second and higher order 
entropies. 

Let us consider now our examples: Beethoven's Sonata 
is rather short (L = 4920). Therefore the first approach 
is restricted to n < 7 for the 3-symbol alphabet and to 
n < 3 for the 32-symbol alphabet. We have obtained 
here as a new result the entropy for n = 8 using method 
(|). The result is 



ff 8 /log(3) = 6.39 
if 8 /log(32) = 2 -49 



3 symbols 
32 symbols . 



(39) 
(40) 



For 9 < n < 26 the second order entropies were estimated 
by means of method (0) . The result can be approximated 
by the fit formula 



# n /log3«2.04-(logn + l) , (A = 3) 



(41) 
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This fit formula reminds the scaling of type eq. ( |18| ) . The 
logarithmic function in eq. ( |4l| ) yields a better fit than 
the power law postulated in an earlier paper |^|. 

Due to the rather large length of the text Moby Dick 
we may apply the method (jl]) up to much larger n-values. 
Taking into account that not all of the combinatorial pos- 
sibilities are admitted we went up to n = 16 for the 3- 
letter alphabet and up to n — 10 for the 32-letter alpha- 
bet. The results are given in table ||. We have checked 
these values also with a consistency test with method (^) . 
A more detailed calculation by means of the distribution 
function method (|^) will be discussed in p6] |. For larger 
n-values the second order entropies (q = 2) were esti- 
mated from the differences between log TV and the ob- 
served entropies (Fig. ||) using eq. (|38|). The result of our 
estimate is given by the fit formulae 

# n /log(3)=4.8-V^-7.6 (A = 3) (42) 
# n /log(32) = 0.9- 1-7 (A = 32). (43) 

These fit formulae remind a scaling law according to 
eq. ( |l9| ) with the exponent 0.5. The low accuracy of the 
data does not allow for a quantitative fit of the exponent, 
in fact any value between 0.4 and 0.6 gives a reasonable 
fit. A scaling law of square root type was first found 
by Hilberg pa] who fitted Shannon's original data. This 
result was reproduced also for a textbook on selforgani- 
zation pjql. 
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CD 



10 10 10 

text length N 

FIG. 8. The measured entropy values for Moby Dick 
over the length of the text investigated for different word 
length n. 

Let us consider now the growth of the number of differ- 
ent subwords. According to relation (^(J) the number of 
different subwords in Moby Dick grows with n according 
to a stretched exponential law 



4.1 10~ 4 • exp (5.2 • Vn) 



(44) 



For the sonata the number of different subwords in- 
creases not as fast as for the English text, we find here a 
quadratic growth law 



N* sa 7.6 • n 2 . (45) 



V. CONCLUSIONS 

The present paper reports on results concerning infor- 
mation carrying strings such as texts and pieces of music. 
The results show that block entropies and the mutual in- 
formation are appropriate measures of the correlations 
and the degree of order in strings. In agreement with 
earlier work [pj we have confirmed the hypothesis that 
strings produced by information processing sources are 
neither periodic nor chaotic but somehow in between. In 
the present work we have studied several examples (two 
books and one piece of music) to illustrate this hypothe- 
sis. 

We have shown that there is some empirical evidence 
for short range and middle range correlations. This is to 
be seen in the statistics of pairs of letters and subwords. 
Another relevant information is contained in the block 
entropies. We developed methods how to guess these un- 
known quantities from relatively short samples. For sub- 
word length n 3> 100 we did not find any indication for 
the existence of pair or word correlations. Our empirical 
studies of the pair correlation function and the spectrum 
derived from this quantity show a strong sensitivity 
on the length of the texts. In spite of all these difficulties 
we may state in conclusion that several distinct differ- 
ences from chaotic strings are observed in texts or pieces 
of music. 

The entropy scaling shows that texts and pieces of mu- 
sic resemble the sequences created by nonlinear dynamic 
systems near critical points through symbolic dynamics. 
However, more empirical data on long texts and pieces of 
music and a more detailed study of the block entropies of 
dynamical systems are needed to elaborate this point fur- 
ther. We state that with the present methods available 
so far it is not possible to calculate high order entropies 
for n > 30 since there are no homogeneous texts which 
are significantly longer than several million letters. 



#2/ log A 


6 


8 


10 


12 


14 


16 


A = 3 


4.85 


6.30 


7.72 


9.10 


10.5 


11.54 


A = 32 


3.21 


3.92 


4.45 









TABLE II. The calculated values of the entropies for Moby 
Dick. 
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